Improved parallelization techniques for the density matrix 

renormalization group 

Julian Rincon*, D. J. Garcia, and K. Hallberg 

Centra Atomico Bariloche and Instituto Balseiro, Comision Nacional de Energia Atomica, CONICET, 8400 Bariloche, Argentina 



'Abstract 



A distributcd-memory parallelization strategy for the density matrix renormalization group is proposed for cases where 
"correlation functions are required. This new strategy has substantial improvements with respect to previous works. A 
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scalability analysis shows an overall serial fraction of 9.4% and an efficiency of around 60% considering up to eight nodes. 
Sources of possible parallel slowdown are pointed out and solutions to circumvent these issues are brought forward in 
order to achieve a better performance. 
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1. Introduction 

■ The impact of numerical methods in the study of phe- 
nomena which are hardly understood by means of ana- 
'lytical machinery has been decisive. Hence, the current 
algorithms ought to be constantly assessed regarding the 
emergence of new concepts and the increasing computing 
technology. Nowadays one of the most successful algo- 
"rithms dealing with one-dimensional interacting systems 
is the so-called Density Matrix Renormalization Group 
■(DMRG) ^. Although this method is not, strictly speak- 
ing, a renormalization procedure, the key idea is the deci- 
mation of the Hilbert space by appealing to the concept of 
the reduced density matrix. This fundamental concept has 
'permitted implementing the DMRG to an extensive vari- 
ety of systems and physical problems such as small grain 
physics, classical 2D systems, nuclear physics, quantum in- 
formation, quantum chemistry, bosonic and fermionic de- 
grees of freedom, and spin systems, together with finite 
temperature and non-equilibrium problems 0, Q . 
' In most of the interesting physical situations, one has 
,to deal with very large systems in order to prevent, for 
•instance, finite-size effects. This fact leads unavoidably 
,to exhaust single-machine resources. Additionally, as the 
dimension of the problem increases, the computational 
costs become more demanding. Bearing this in mind, it 
seems natural to request for a distributed kind of calcu- 
lation. Earlier proposals consisted on shared-memory ap- 
proaches [4] for the DMRG: this method was based on 
the multithreaded API (Application Programming Inter- 
face), namely, OpenMP [5|. Distributed- memory versions 
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of DMRG have been recently proposed in several con- 
texts [1, 0, H, For DMRG calculations in quantum 
chemistry very powerful parallel algorithms have been pro- 
posed with two basic approaches: (i) the clever distribu- 
tion of the local, doubly, and triply contracted orbital op- 
erators with an almost linear speedup Q, or (ii) the dy- 
namical scheduling of the sub-blocks of the orbital opera- 
tors labeled by their corresponding quantum numbers 0] ■ 
Concerning strongly correlated systems, there have been 
a few solutions to handle two-dimensional geometries by 
coding a parallelization that converts the superblock vec- 
tors into distributed matrices 8], or a generic version of 
a one-dimensional DMRG including a parallelization over 
symmetry-related matrix blocks Q. 

The main idea behind these methods was to parallelize 
the central operation of a ground-state DMRG simulation: 
the matrix-vector multiplication in the diagonalization of 
the superblock Hamiltonian. However, the scheme does 
not take into account calculations of measurements such 
as expectation values, multiple-point correlation functions, 
and structure factors, for which the most time-consuming 
part of the algorithm is the huge amount of matrix-matrix 
multiplications (i.e. density-matrix rotations) of the oper- 
ators one is interested in. In addition, the shared-memory 
scheme would already show scalability problems in a large- 
scale computation including the calculation of such phys- 
ical quantities. 

In this work, in addition to receding the ground- 
state DMRG in the well-known passing message standard 
MPI [Tol l (henceforth regular parallelization), we propose 
an improved strategy that takes into account the heavy 
rotations associated to the calculation of the correlation 
functions; this policy is also implemented in MPI allowing 
us to perform genuine high-performance simulations 



Preprint submitted to and accepted in Computer Physics Communications 



April 20, 2010 



System 



Environment 



B{£,m) 



Figure 1: DMRG block configuration. The superblock is formed with 
two blocks (B and B) and two (exact) sites. Added sites a and 5 
are shown as circles. Dashed lines represent system and environ- 
ment blocks. The tilde on the right block means that no reflection 
symmetry has been assumed. 



Two approaches to deal with these rotations are proposed. 
The first strategy is based on a pool of tasks in which 
there is a master node distributing queues to the rest 
of the slaves. The second application performs a block- 
fashion single distribution considering all nodes with an 
equal amount of work, hereafter the uniform-matrix dis- 
tribution (UMD) strategy. The latter is easier to imple- 
ment and more efficient than the former. We obtain simi- 
lar results for the speedup and performance to previously 
reported ground-state DMRG simulations with OpenMP. 
The chosen benchmark was the one-dimensional Hubbard 
model 
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In the forthcoming sections the DMRG algorithm will 
be briefly described, then the usual and new paralleliza- 
tion strategies will be presented and speedup/performance 
results are analyzed. Thereupon, an application test on 
the Hubbard model is done to estimate the runtime im- 
provement due to the parallelization ideas of the previous 
sections, and we finally summarize significant concepts. 

2. The DMRG algorithm 

This variational, non-perturbativc and highly accurate 
method [2| was developed as an attempt to solve the low- 
lying energy properties of many-body models that tech- 
niques such as exact and Lanczos diagonalization [12] , nu- 
merical renormalization group (NRG) fl3l or other ana- 
lytical tools could not be able to deal with; moreover this 
method does not have the sign problem that emerges in 
Monte Carlo techniques fl^. It can be considered as an 
improved version of Wilson's NRG for which the states 
kept during the decimation procedure are no longer se- 
lected regarding their energy but instead, they are chosen 
by means of the density matrix, which naturally gives the 
most relevant states to be kept (with respect to, e.g. the 
lowest- lying eigenstate of the whole system). 

The standard configuration used in the DMRG algo- 
rithm is shown in Fig. [1] We assume the following nota- 
tion: B{i, m) a block composed of £ sites with a Hilbert 
space of dimension m and a{£',m') a small added block 
(usually a single site, e.g., for the Hubbard model case: 
i' = 1 and m' = 4). Therefore, the superblock is formed by 
the union of two blocks and two sites as shown in Fig. [1] 
This superblock is built up of two main parts: the sys- 
tem and the environment composed by a block-site each. 



B(i^ m) is a vector space with a completeness relation close 
to but not equal to 1 due to the decimation processQ On 
the contrary, the subspace a is always complete. 

The main goal is typically the lowest-energy (ground) 
state of the superblock Hamiltonian H which can be writ- 
ten as 

i J 

where and stand for the orthonormal basis for 

the system and the environment respectively and V'o,ii = 
(i (8) j'lV'o)- A truncation procedure should be now estab- 
lished in order to get manageable Hilbert spaces. To this 
end, DMRG resorts to the reduced density matrix of the 
system: 
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(2) 



This matrix possesses non-negative eigenvalues Wa with 
eigenvectors \wa) {p\wa) — Wa \wa))- It can be shown Q 
that these eigenvalues are proportional to the probability 
of the system being in the state \wa)- Selecting the corre- 
sponding eigenstates which have the largest probabilities 
Wa, we can set a cutoff such that we have a very efficient 
decimation formula. This error source can be quantitative 
described by defining the truncation error 
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where m is the cutoff, a truncation number selected of- 
ten by hand. It can be shown 0, [sf that the error in the 
ground state goes as IHV'o) ~ l''/'o)lP = £p where l-^o) is the 
DMRG approximation to the exact ground state. A sim- 
ilar bound can be found for the expectation values. It is 
also shown that the energies obtained with DMRG will be 
upper bounds on the exact eigenvalues. From Eq. ([3]) it 
is evident that the more states are kept the higher the ac- 
curacy of the calculated energies and observables will be. 
Another (generally smaller) source of error in ji/'o) is due 
to the iterative method used to diagonalize the superblock 
Hamiltonian. As a consequence of the Hilbert space trun- 
cation there is an environmental error which has to do 
with the fact that the bath coupled to the system is not 
exact. The environmental error can be reduced by imple- 
menting the so-called finite system algorithm. 

The arrangement shown in Fig. [T]is usually used in two 
ways: on one hand, the infinite system algorithm in which 
the superblock size is grown by adding two new sites in 
the middle of the chain at each iteration step. And on 
the other hand, the finite system algorithm is designed to 
calculate highly accurate properties of the superblock at 
a given lattice length. It consists on moving back and 



^ For the reader not familiar with DMRG, blocks and sites can 
be thought of as vector spaces on which there are certain conditions 
for well-defined states and operators. 
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forward (sweeping) the division between system and envi- 
ronment (it can be thought of as a thermahzation of the 
system and environment blocks). 

All these steps can be summarized in the following way: 

1. Start with left and right blocks as exact single sites. 

2. Diagonalize the superblock Hamiltonian H defined on 
[B{i, to) a a B{£, in)] to obtain \tpo). 

3. Build up all of the block operators related to H and 
measurements defined on [B a] = [B ® a]. 

4. Define and diagonalize p in the system. Find the ro- 
tation matrix TZ = {\wi)\w2) ■ ■ ■ \wm))'^ formed from 
the TO largest eigenvalues Wa of p. 

5. Perform the decimation and rotation step [B] i — 
TZ [B a] TZ^ for the operators defined in step 3. 

Go to step 2. 

When the desired system size has been achieved, mea- 
surements of the relevant quantities such as structure fac- 
tors, spin and charge gaps, binding energies, etc. can be 
performed. 

Since our main concern is the computation of n-point 
correlation functions for several operators Z, we have to 
provide a form for such matrices. This type of simulation 
can be included in the standard algorithm just manag- 
ing those Z operators in the same way as the superblock 
Hamiltonian operators are handled, that is, by doing the 
transformations of blocking Z^^ < — -^[seo] ^^nd then 
the rotation and the decimation step Zj^j < — TZZ^^ a]^^ ■ 
All of the operators Z are managed as block matrices in- 
stead of as block-site matrices reducing the consumed com- 
putational resources and saving time on I/O operations. 



2.1. Benchmark 

We have tested the parallel algorithm with a simulation 
of the one-dimensional quarter- filled Hubbard model 11 1. 
The Hamiltonian of the model reads: 



H 



Ci(j ~t~ C, 



^Cj+1ct) 



(4) 

where Cia (c^) denotes an electron annihilation (creation) 
operator on site i with spin a = (t,i). Here, Cia is an 
m X m matrix. Regarding storage effects, a implies two 
different matrices for Ci^ for each site i. t and U are pa- 
rameters standing for electron hopping and on-site electron 
repulsion respectively. 

The charge N{q) and spin S^{q) structure factors 
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(5) 



k,3 



were calculated, the number operator is Uia 
Tii — nif -l- nj^, and n is the charge expectation value. 
As it can be seen, obtaining these two quantities requires 



the calculation of the expectation values (ui) and all of 
the charge-charge {uiUj) and spin-spin (S^Sj) correlation 
functions. 



3. Parallelization 

There are two main architecture paradigms in paral- 
lel computing: systems with a single address space called 
shared-memory systems allowing multiple processors to 
access the same memory location (data) and distributed- 
memory systems in which each processor has its own ad- 
dress space and therefore its own data structure. Both 
paradigms can be successfully applied to the DMRG me- 
thod 0, 0] . Earlier distribution strategies worked well on 
a shared- memory system methodology; nevertheless, this 
type of architecture eludes a massively parallel approach. 
Consequently, a distributed-memory policy should be de- 
veloped in order to get a coarse-grain scheme reaching 
larger lengths and more states per block using modest 
computational resources. Here, in addition of putting for- 
ward a new parallelization scheme, we have changed the 
shared-memory (OpenMP) approach to a standard mes- 
sage passing API (MPI) [10|. 

As we will show below very similar results are obtained 
to the OpenMP case with the possibility of improving scal- 
ability properties. This distributed approach has the ad- 
vantage of avoiding collisions (present on MP algorithms) 
at the presumable cost of using more resources and larger 
communications. The calculations presented in this work 
were performed using a cluster with Intel® Xeon 2.50 GHz 
CPU cores (with a memory of 1 GB per node) arranged 
either as a double quad-core system or as single cores 
in a star topology network with a nominal bandwidth of 
900 Mb/s. 

Let us now briefiy summarize the analytical apparatus 
needed to study the speed of a high-performance realiza- 
tion [1^]. The speedup Sp indicates how much faster a 
parallel code on a p-node process is with respect to the se- 
quential analogue. Sp is explicitly defined as the fraction 
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(6) 



where Ti and Tp are the wall-clock times of the simulation 
with 1 and p processors respectively. The ideal speedup 
should scale linearly with p, that is, S^^°^^ — p. Another 
quantity of interest which illustrates how much the algo- 
rithm is exploiting a single processor is the efficiency which 
reads 

Ep = ^. (7) 

p 

In the simplest model, the sequential time of a program 
(normalized to 1) can be split into a serial fraction E and 
a parallel fraction 1 — E. With a finite number of nodes p, 
the parallel fraction gets reduced by (1 — T,)/p; based on 
these considerations we obtain Amdahl's law 1161 for the 
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relative speedup 



Sa{p) 



1 - s 



(8) 



thus, the maximum speedup achievable (i.e. with p — >■ oo) 
would be S'yi — >■ This amount gives us a rough idea of 
the expected efficiency in a distributed implementation!^ 

3.1. Regular Parallelization: Ground- state DMRG 

It is well known that the most time-consuming part in 
the ground state DMRG is obtaining the lowest eigen- 
value of the superblock Hamiltonian H by means of an 
iterative procedure (such as Lanczos [13] or Davidson 
algorithms). Since H is actually a sum of terms involv- 
ing left {system formed hy B®a) and right [environment 
formed by a © B) matrix products, we can readily write 
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Figure 2; Speedup scaling of a DMRG calculation for the ground 
state. Circles correspond to the Davidson algorithm (step 2 in sec- 
tion[2ll performance for the 9 index distribution case (S = 8.6(2)%). 
Triangles correspond to the total DMRG calculation of the ground 
state (S = 11.5(3)%) with the corresponding Amdahl's law (full 
line). Ideal scaling is included for comparison (dashed line). 



where represents a generic operator defined on any of 
the blocks {X — B, a, a, B) and A corresponds to each of 
the terms in Eq. (|4]). Typical terms are for instance, the 
hopping term between the left block and left site: c^Cq = 

Cg^Ca® la 

for the sites belonging to B. Ix stands for the identity on 
the space X. 

If the implementation incorporates symmetries, such as 
particle number or total magnetization, then H takes the 
form 



3 or the right block Hamiltonian: H-^ = 
Hb which should contain all of the H terms 



i/ = 5] 5] (0^ ) ® (r ) (r ) ® (^^ ) , ( 10) 
A e 

explicitly showing that the operators {9-^ ) are labeled 
by their quantum numbers. The value 6-^ is a symme- 
try index of the X block, and 9 is an index running over 
the superblock basis formed by the configurations with the 
quantum number 9-^ + 9"^ + 9°" + 9^ fixed. Using symme- 
tries helps to minimize the size of nested loops. Usually H 
is a very large matrix (e.g. with dimension A4 ^ 10'* — 10^), 
thus it is never explicitly constructed but rather consists 
of multiplication rules. This means that given a vector \b) 
we get the iJ-multiplied result H\b). 

We shall now get into the aspects of the parallelization 
idea. There is a basic tactic without handling the matrix- 
vector multiplication which would be that of distributing 
only the Hamiltonian terms mentioned above, that is, the 
A index in Eq. ([TUl) . Explicitly, one node will deal with 
Hb, another node will address the c^c^ term, and so on. 
However, this plan is prone to poor scalability showing 
parallel slowdown already for 6 nodes with a speedup of 



only 1.5. This slowdown is perhaps due to load imbalance 
since not all of the Hamiltonian terms involve the same 
number of operations. The site-site interaction consists 
only of a few logical rules, but terms such as block-site or 
site-block have to iterate over tensor products. Even when 
we compare these last two terms there is also an imbalance 
because of roaming over fast and slow matrix indices. 

A more efficient option consists of the distribution over 
the central {9 = 1, . . . , M) loop of the matrix- vector mul- 
tiplication on the diagonalization algorithm (Davidson in 
our case). Each task will apply the full H to [M/p\ states 
and the first mod {A4,p) tasks will handle an extra state|f| 
We do not distribute the sub-blocks of the relevant oper- 
ators labeled by their quantum numbers because of their 
dissimilar dimensions. With this strategy, we get values 
of speedup of 3.5 in an 8-node process with a serial frac- 
tion of 11.5%. To achieve an even faster realization when 
distributing over the 9 index, one should also share out 
all of the linear algebra (daxpy, ddot, dscal, and dcopy) 
operations in the Davidson algorithm. These operations 
include orthonormalizations, inner products and the nor- 
malizations of the vectors added to the Davidson basis 
expanding the ground state IV'o)- In doing so, we have 
now moved up the speedup to 4.9 on 8 nodes (S = 8.6%). 
The scalability properties of the distributed version of the 
DMRG calculation for \ipo) are shown in Fig. [2] The load 
imbalance in this case goes as p/A4 which is negligible for 
actual DMRG simulations. 

The performance properties of Davidson parallelization 
are strongly affected by the reduction operations of the 
matrix-vector multiplication, hence the better the imple- 
mentation of these the better the speedup will be. This 
leading behavior could be diminished by ordering the su- 



^ This fixed-sized problem law neglects important effects such as 
overhead, cache effects, network latency, etc. 



[x/y] meaning the integer division and mod{x,y) stands for 
the modulo operation with x and y real numbers. 
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perblock basis properly. This way, all of the reduction calls 
of order M are optimized by calls of order M./p or less. 
To show this, we have used a test block-diagonal matrix 
that does not require any reduction calls at all in the ap- 
plication of H. By doing this, we have obtained a serial 
fraction of S = 0.87(2)% (down to 20 processors) on the 
Davidson scheme, whereas when we consider the Hubbard 
Hamiltonian, we get a serial fraction of 8.6% as a result. 

The most simple distribution one can think of was imple- 
mented in the rotation (decimation) of the operators rele- 
vant to H (such as c-|-, cj, for the Hubbard model case), that 
is, a row-distributed matrix-matrix multiplication. The fi- 
nal result is a serial fraction of 25% for this section of the 
algorithm. The reader should remember that Amdahl's 
law is a very simplistic proposal on the performance of 
a parallelized algorithm; serial fractions allow us to easily 
understand the results and what to expect of a distributed 
version of the serial code. 

In order to better understand the performance obtained, 
we now make a comparison between our MPI implemen- 
tation of the ID Hubbard model and the shared-memory 
(OpenMP) version of the 2D Hubbard model jl]. The 
whole DMRG performance of the MPI implementation 
shows a better behavior than in the shared-memory ver- 
sion (E — 11.5% compared to E = 16% [4]) in spite of the 
fact that the Davidson algorithm results are not as good as 
previous ones (E — 8.6% compared to E = 6.5% [^]). This 
improved behavior could be related to the additional paral- 
lelization of the linear algebra operations mentioned above, 
added to the absence of collisions (and despite message 
passing) on the MPI algorithm or better communications 
originated on newer hardware improvements. Even though 
this comparison is not strictly valid because we are deal- 
ing with different geometries (ID versus 2D Q Hubbard 
models), we must remark that our case is the worst case 
scenario. In ID we have fewer Hamiltonian terms, mean- 
ing fewer independent processor operations in comparable 
Hilbert spaces with a similar amount of communications. 
This would suggest that for a more complex Hamiltonian 
(e.g. including longer range hoppings or different geome- 
tries such as 2D) our result for the serial fraction will be 
even smaller. 



Table 1: Relative runtimes and serial fractions percentages at dif- 
ferent steps of the algorithm. The unparallelized time (item d) is 
mainly consumed in building the system (or the environment), the 



density matrix and getting its spectrum. 


Step 


Time (%) 


E(%) 


a. Davidson algorithm 


19.7 


9.4(1) 


h. Zi and ZiZj rotations 


72.3 


8.1(3) 


c. H operators rotations 


0.1 


25(2) 


d. Unparallelized sections 


0.5 


100(0) 


e. Measurements 


7.4 


7.6(7) 


/. Total calculation 


100 


9.4(1) 



3.2. Novel Strategy: Correlation operators 

If n-point correlations are required, the former distribu- 
tion setup turns out to be insufficient because the ground 
state determination is not the longest time-consuming 
part anymore and is overtaken by the operator decima- 
tion and rotation (see Tabled]). Therefore a new approach 
is mandatory to deal with that issue. The new strategy 
should take into account that the most time-expensive part 
is in this case the double matrix operation of the corre- 
sponding operators Zi and ZiZj (e.g. for TZZiR,'^: ZiRf^ 
and then Tl{ZiTZ'^)). Typical correlation functions are the 
one-point and two-point functions [2| , namely, 

(Z,) = (V^ol^dV'o), ^j^^^ 
(ZiZ-j) = (iJjolZ^Zjlijo) 

with i,j — 1,...,L and L the length of the superblock 
chain. The number of (stored) matrices to be rotated 
(see section [5J last step) at a given length calculation is 
C = £{£ -I- 3)/2 {£ matrices coming from single-site opera- 
tors Zi and £{£ + l)/2 coming from two-point correlation 
functions ZiZj with i < j), with £ being the number of 
sites of the system or environment according to forward or 
backward sweeping. The correlations between the B and 
B blocks were calculated as a product of single-site oper- 
ators in each block. The specific tasks involved in step 5 
(see section[2) are: («) the reading of the current matrix Zi 
from storage, (m) the blocking step Z^bo] ^ — -^[Beajj (iH) 
the two matrix-matrix products with the rotation matrix 
TZ, and {iv) the corresponding saving of the new matrix 

We shall show below two ways to address this issue: 
a pool of tasks Is*] and what we have called a uniform- 
matrix distribution (UMD) parallelization. In this latter 
strategy every node has almost the same load (see below) 
without a master node. The UMD parallelization seems to 
have a better output because it has fewer communications 
(only at the very beginning of the subroutine) and takes 
more advantage of the nodes available during the calcula- 
tion (see below). The pool of tasks is a more elegant and 
common solution but in practice, a slower option. The 
speedup results for these two parallelized DMRG calcu- 
lations of correlation functions are shown in Fig. |31 The 
efficiency for the UMD case is shown in Fig. |4l 

In the pool of tasks paradigm fis'l , the data to be pro- 
cessed (the Z matrices) are divided into small units with 
similar structures called tasks. All of these tasks form the 
so-called task pool. One node, the master process, manages 
this large amount of tasks, always sending to idle workers 
more work to do until all of the tasks have been executed 
(empty pool). This model is effective in situations where 
the available nodes have very different technical specifica- 
tions, because the least loaded or more powerful hosts do 
more of the work and all of the hosts stay busy most of the 
runtime. The serial fraction obtained in this implementa- 
tion was about 11.1% (see Fig. [31). The optimal result 
depends on the number of tasks in which the whole job 
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Figure 3: Pool of tasks (squares) and UMD parallelization (triangles) 
performance for step 5 in section [2] The values of the serial fractions 
were E = 11.1(2)% and S = 8.1(3)% respectively. The speedup 
factor corresponding to the total DMRG calculation using the UMD 
technique for the calculation of the correlation functions (circles) was 
E = 9.4(1)%. Its corresponding Amdahl's law is also included (full 
line) and the ideal scaling is shown for comparison (dashed line). 



is divided. If this number is too small, parallel slowdown 
will already appear. In addition, the greater the number 
of tasks the bigger the amount of communications will be. 

Let us now explain the UMD technique. This distribu- 
tion proves to be easier to code and more efficient than the 
pool of tasks. The key idea is to keep all of the processors 
on the same working settings so we can take full advantage 
of the accessible hardware. The distribution is performed 
in terms of blocks of contiguous local and non-local oper- 
ators. If the number of processors is p then each processor 
stores [yC/pJ operators, except maybe the first mod(£,p) 
ones that will store \_C/p\ -1-1 matrices. Load imbalance in 
this case goes as p/ C, which is imperceptible for larger lat- 
tice lengths, i.e. larger C. The serial fraction has now been 
improved to S = 8.1% (in the double quad-core system) 
as shown in Fig. [31 

There are many more communications in the pool of 
tasks compared to the UMD case. These communications 
are related to petitions coming from the workers involving 
statuses such as: "task done" and "ready to work"; and 
the complementary messages sent by the master node with 
the proper information about the task to be made. On the 
contrary, the UMD settings just need very few communica- 
tions that keep track of the set of operators to be handled 
by each node. This message passing should be posted at 
the beginning of the corresponding iteration. 

In both parallelization policies, if a given node demands 
a specific set of matrices that is not currently in local stor- 
age, an implemented queue manager handles this type of 
requests by sending the matching operator. This is done 
by means of a book-keeping of the matrices and its current 
owners throughout the entire cycle. Hence, when all of the 
desired matrices have been shipped, a new-owner message 
should be broadcasted to the rest of the active processors. 
The rotation matrix TZ is replicated along all of the nodes. 




Figure 4: Efficiency plot for the UMD case shown in Fig. [3] with 
the same symbol convention. The shaded region corresponds to the 
cases where no speedup is gained compared to the p = 1 case. 



This procedure allows each processor to save runtime by 
storing the new operators locally. For instance, if at some 
point through the simulation a processor, say, number 1 
requests an operator that in an earlier step was assigned 
to processor, say, number 2, the queue handler transfers 
the required matrix from processor 2 to the corresponding 
node making an update of the owner matrix-bookkeeping. 
This procedure does not affect the task being performed 
by processor 2 avoiding synchronization delays. For the 
UMD case we have found a serial fraction of E = 9.4% in 
the star topology network. 

The origin of the serial fraction of the presented par- 
allelization schemes is perhaps due to the following fac- 
tors: processes contending available cache space, racing 
conditions linked to the storage of the corresponding ma- 
trices, or the transfer of the requested data between pro- 
cesses. In order to reduce the total serial fraction of the 
whole process attention should be paid to the rotations 
of the operators (item h in Table [T]), the Davidson algo- 
rithm (item a), and the unparallelized sections (item d). 
The measurements are discussed below. As for item 6, the 
most time-expensive of all of the four steps at this point 
(addressed at the beginning of this subsection) would be 
consecutively: the two matrix-matrix products, the writ- 
ing of the outcome to disk, the reading of the input from 
disk, the blocking operation and, in the star-topology case, 
the matrix copying among nodes. Unavoidable points are 
probably the I/O operations, the matrix multiplications, 
and the optimized blocking due to the use of symmetries. 
Therefore the candidate stage to be improved is the data 
transfer protocol (ssh-server) for the networking case. 
Using a socket-type communication or a remote server will 
certainly enhance the achieved speedup. As for the David- 
son step, in all of the strategies, one could try to reduce the 
few synchronization calls with the consequence of having 
more local operations. And finally, the total serial frac- 
tion could be reduced further if some kind of paralleliza- 
tion scheme is implemented in the unparallelized section 
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of item d. 

There is a small discrepancy between the values of the 
serial fractions shown in Fig. [5] and Table [1] item a (with 
and without correlations) for the Davidson part. This 
may be due to the effect of the compilation when cor- 
relations are included. However, the values are compati- 
ble within the numerical error. Now, taking into account 
the Davidson diagonalization, as well as the Hamiltonian 
operators and the rotation of the operators to be mea- 
sured, we should get a weighted average serial fraction of 
S^totai = 8.8% as for the parallelized sections, but due to 
the unparallelized fraction of the code (item d) the final 
serial fraction is actually 9.4%. Finally, the correspond- 
ing distribution was done for the measurement part in 
the same way as for the distribution over the 9 index in 
Eq. (fTU)) . with the exception that the A^-size vector reduce 
calls have been replaced by single-data reductions associ- 
ated to the partial inner products (t/ioI^IV'o)- The serial 
fraction for this section of the algorithm was S = 7.6%. 
This value is probably related to the reading of the Z ma- 
trices from local or remote storage depending on the final 
Z-bookkeeping. It should be mentioned that this is just 
a minor optimization compared to the whole calculation, 
but it is rather straightforward to code this section of the 
DMRG algorithm once that of the Davidson diagonaliza- 
tion has been implemented. 

To estimate the performance of each node as compared 
to communication times, we show in Fig. |4] the parallel 
efficiency of the whole process in the UMD case. This 
quantity shows a very nice behavior up to the number 
of nodes used. For the p = 8 case Ep is around 60% 
meaning that each processor is actually working more than 
half of the total computational time. It also shows the 
good reliability of the parallelized algorithm suggested in 
this work. Parallel efficiency of a single-CPU is shown 
for comparison (continuous line). An improvement in the 
overall efficiency was observed when the number of states 
kept was increased (to = 400 — 1000), as expected from 
a non-fixed-sized parallel problem 19^. For instance, for 
m = 1000, Ep is increased by 20% for p = 8 with an overall 
serial fraction of S = 5.5%. It should be noticed that the 
more operators are measured the more effective this novel 
strategy will be. 

Simulations of ladder-type systems have shown that the 
ratio of runtimes between Davidson diagonalization and 
the rotation of the operators is not as remarkable as in the 
one-dimensional case. However, for long enough systems, 
the time of the rotation of the operators will be a signifi- 
cant part of the total time justifying the implementation 
of the present parallelization strategies. The change of 
the Davidson runtime stems from the increasing number 
of terms of H as pointed out in the previous subsection. 

Lastly, in order to reproduce well-known results for the 
S'^{q) and N{q) structure factors ^2^, we have performed 
serial and distributed numerical simulations for a quarter- 
filled one-dimensional Hubbard chain of L = 128 sites with 
m — 400 states per block and an interaction parameter 



U/t = 8. Two sweeps for the finite-size algorithm and 
open boundary conditions were imposed in the calculation. 
The truncation error was ep ^ 10~^. The total runtime 
on a 1-node process was about 165 hours compared to, for 
instance, 33 hours on an 8-node process. 

4. Conclusions 

We have presented an efficient parallelized version of 
a DMRG code devoted to the calculation of n-point cor- 
relation functions. Unlike previous approaches, the cur- 
rent strategy was implemented in a passing message con- 
text (MPI) allowing for a better performance than for 
the shared-memory scheme. The overall serial fraction of 
the whole process was about 9.4% and the efficiency was 
around 60% up to eight nodes. In spite of the fact that 
our parallelization scheme does not scale well to hundreds 
of nodes it does allow simulations not reachable by serial 
coding with a maximum speedup of 1/E = 10.6 according 
to Amdahl's law. Causes of parallel slowdown were ad- 
dressed and possible ways of decreasing the serial fraction 
were presented. 
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